For the following exercises we will also data from Gapminder; this time on life expectancy.

As per usual, we first need to read in the data. You can just copy, paste and run the following code in(to) your script.

library(readr)

gap_life <- read_csv("../data/gapminder/life_expectancy_years.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   country = col_character()
## )
## See spec(...) for full column specifications.

Again, the data are currently in wide format.

1

Select only data for the 20th century, but this time use a helper function instead of specifying a range columns.
The helper function you should use here is starts_with(). We also want to keep the country column.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gap_life %>% 
  select(country, starts_with("19"))
## # A tibble: 187 x 101
##    country `1900` `1901` `1902` `1903` `1904` `1905` `1906` `1907` `1908`
##    <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Afghan~   29.2   29.3   29.3   29.4   29.4   29.5   29.6   29.6   29.7
##  2 Albania   35.5   35.5   35.5   35.5   35.5   35.5   35.5   35.5   35.5
##  3 Algeria   30.1   30.2   30.3   31.3   25.3   28     29.5   29.4   29.3
##  4 Andorra   NA     NA     NA     NA     NA     NA     NA     NA     NA  
##  5 Angola    29.5   29.6   29.7   29.8   29.9   30     30.1   30.1   30.2
##  6 Antigu~   33.7   33.7   33.7   33.7   33.7   33.7   33.7   33.7   33.7
##  7 Argent~   36.6   37.2   37.8   38.3   38.9   39.5   40.2   41     41.7
##  8 Armenia   35.2   35.4   35.6   35.8   36.1   36.3   36.5   36.7   36.9
##  9 Austra~   50     50.5   51.1   51.6   52.1   52.7   53.2   53.7   54.3
## 10 Austria   41.5   42     41     40.1   40.7   41.3   42     42.6   43.2
## # ... with 177 more rows, and 91 more variables: `1909` <dbl>,
## #   `1910` <dbl>, `1911` <dbl>, `1912` <dbl>, `1913` <dbl>, `1914` <dbl>,
## #   `1915` <dbl>, `1916` <dbl>, `1917` <dbl>, `1918` <dbl>, `1919` <dbl>,
## #   `1920` <dbl>, `1921` <dbl>, `1922` <dbl>, `1923` <dbl>, `1924` <dbl>,
## #   `1925` <dbl>, `1926` <dbl>, `1927` <dbl>, `1928` <dbl>, `1929` <dbl>,
## #   `1930` <dbl>, `1931` <dbl>, `1932` <dbl>, `1933` <dbl>, `1934` <dbl>,
## #   `1935` <dbl>, `1936` <dbl>, `1937` <dbl>, `1938` <dbl>, `1939` <dbl>,
## #   `1940` <dbl>, `1941` <dbl>, `1942` <dbl>, `1943` <dbl>, `1944` <dbl>,
## #   `1945` <dbl>, `1946` <dbl>, `1947` <dbl>, `1948` <dbl>, `1949` <dbl>,
## #   `1950` <dbl>, `1951` <dbl>, `1952` <dbl>, `1953` <dbl>, `1954` <dbl>,
## #   `1955` <dbl>, `1956` <dbl>, `1957` <dbl>, `1958` <dbl>, `1959` <dbl>,
## #   `1960` <dbl>, `1961` <dbl>, `1962` <dbl>, `1963` <dbl>, `1964` <dbl>,
## #   `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
## #   `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
## #   `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
## #   `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
## #   `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
## #   `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
## #   `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>

As you may have already noticed, the dataset some missing data points. Before we start analyzing the data we might want to know for how many countries we have complete data.

2

Using the dataset in wide format, find out for how many countries we have complete data?
To answer this question you should use the drop_na() function from tidyr.
library(tidyr)

gap_life %>% 
  drop_na() %>% 
  nrow()
## [1] 184

As in the previous set of data wrangling exercises, we now want to transform the data into long format.

3

Transform the gap_life dataset into a sensible long format. Name the variable representing the values for life expectancy lifeExp and store the resulting dataframe in a name with the same object as before (gap_life).
This is just a repetition from the Tidy Data exercises. What we want to do is to gather the columns with the years into a year variable.
library(tidyr)

gap_life <- gap_life %>% 
  gather(-country, key = "year", value = "lifeExp")

Now let’s apply some of the advanced filtering options we discussed in the Data Wrangling - Part 2 session.

4

Create two new dataframes that include different subets of the gap_life data: 1. Data for all countries for 1990s (name this one gap_life_1990s), 2. Data for all years but only for Germany (name this one gap_life_ger). NB: There are different Germanies in the dataset: West Germany. East Germany, and Germany. Before we can create the first new dataset, we need to ensure that the year variable is of the right type. To achieve this you can just copy, paste, and execute the following lines of code.
gap_life <- gap_life %>% 
  mutate(country = as.factor(country),
         year = as.integer(year))
## Warning: NAs durch Umwandlung erzeugt
You need to use a helper function from dplyr to create the first new dataframe and a specific matching operator to create the second one.
gap_life_1990s <- gap_life %>% 
  filter(between(year, 1990, 1999))

gap_life_1990s
## # A tibble: 0 x 3
## # ... with 3 variables: country <fct>, year <int>, lifeExp <chr>
gap_life_ger <- gap_life %>% 
  filter(country %in% 
           c("Germany", "West Germany", "East Germany"))

gap_life_ger
## # A tibble: 438 x 3
##    country  year lifeExp
##    <fct>   <int> <chr>  
##  1 Germany    NA 1800   
##  2 Germany    NA 1801   
##  3 Germany    NA 1802   
##  4 Germany    NA 1803   
##  5 Germany    NA 1804   
##  6 Germany    NA 1805   
##  7 Germany    NA 1806   
##  8 Germany    NA 1807   
##  9 Germany    NA 1808   
## 10 Germany    NA 1809   
## # ... with 428 more rows

For some comparisons (especially via plots), it might help to know which continent the country is located on. For this purpose, we will create a new continent variable. As it would be quite tedious to create this variable manually for all of the countries in the dataset, we will do this only for a subset in this exercise. Just run the following code in your local script to create this subset.

gap_life_subset <- gap_life %>% 
  filter(country %in% 
           c("Netherlands", "Brazil", "China", "Algeria", "New Zealand"))

5

Create a cotinent variable for the countries in gap_life_subset. The variable should be a factor and its values the following: Africa, Americas, Asia, Europe, Oceania.
You should use the case_when() to create this new variable.
gap_life_subset %>% 
  mutate(continent = factor(case_when(
    country == "Algeria" ~ "Africa",
    country == "Brazil" ~ "Americas",
    country == "China" ~ "Asia",
    country == "Netherlands" ~ "Europe",
    country == "New Zealand" ~ "Oceania")
    ))
## # A tibble: 2,190 x 4
##    country      year lifeExp continent
##    <fct>       <int> <chr>   <fct>    
##  1 Algeria        NA 1800    Africa   
##  2 Brazil         NA 1800    Americas 
##  3 China          NA 1800    Asia     
##  4 Netherlands    NA 1800    Europe   
##  5 New Zealand    NA 1800    Oceania  
##  6 Algeria        NA 1801    Africa   
##  7 Brazil         NA 1801    Americas 
##  8 China          NA 1801    Asia     
##  9 Netherlands    NA 1801    Europe   
## 10 New Zealand    NA 1801    Oceania  
## # ... with 2,180 more rows